This notebook relates to the TensorFlow Speech Commands Dataset. TensorFlow Speech Command dataset is a set of one-second .wav
audio files, each containing a single spoken English word. These words are from a small set of commands, and are spoken by a variety of different speakers. It was designed for limited vocabulary speech recognition tasks. This dataset can be obtained for free from the IBM Developer Data Asset Exchange.
In this notebook, we will download the dataset archive from cloud storage, extract it, explore the dataset and import audio samples into our Watson Studio project.
Before you run this notebook complete the following steps:
When you import this project from the Watson Studio Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context
If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project's resources:
More -> Insert project token
in the top-right menu sectionThis should insert a cell at the top of this notebook similar to the example given above.
If an error is displayed indicating that no project token is defined, follow these instructions.
Run the newly inserted cell before proceeding with the notebook execution below
import requests
import os
import tarfile
from pathlib import Path
from urllib.parse import urlparse
import glob
import IPython.display as ipd
from IPython.display import Markdown, display
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
def printmd(string):
display(Markdown(string))
First, we download the TensorFlow Speech Commands data set archive from the Data Asset Exchange cloud storage and extract the data files.
# Dataset archive location on public cloud storage
fname = 'tensorflow-speech-commands.tar.gz'
url = 'https://dax-cdn.cdn.appdomain.cloud/dax-tensorflow-speech-commands/1.0.1/'
data_path = 'TensorFlow-Speech-Commands'
filenames = ['on/0a7c2a8d_nohash_0.wav', 'off/0ab3b47d_nohash_0.wav', 'up/0a7c2a8d_nohash_0.wav', 'bird/0a7c2a8d_nohash_0.wav', 'bird/0c2ca723_nohash_1.wav',
'sheila/00f0204f_nohash_1.wav', 'cat/0ab3b47d_nohash_0.wav', 'dog/0b09edd3_nohash_1.wav', 'right/0a7c2a8d_nohash_0.wav',
'bird/0b77ee66_nohash_0.wav', 'bird/0eb48e10_nohash_1.wav', 'bird/0fa1e7a9_nohash_0.wav', 'bird/1d919a90_nohash_2.wav', 'zero/0c40e715_nohash_0.wav']
download_link = url + fname
r = requests.get(download_link)
Download and extract the dataset archive.
print('Downloading dataset archive {} ...'.format(download_link))
r = requests.get(download_link)
if r.status_code != 200:
print('Error. Dataset archive download failed.')
else:
# save the downloaded archive
print('Saving downloaded archive as {} ...'.format(fname))
with open(fname, 'wb') as downloaded_file:
downloaded_file.write(r.content)
if tarfile.is_tarfile(fname):
# extract the downloaded archive
print('Extracting downloaded archive ...')
with tarfile.open(fname, 'r') as tar:
tar.extractall()
print('Removing downloaded archive ...')
Path(fname).unlink()
print('Done.')
else:
print('Error. The downloaded file is not a valid TAR archive.')
In this section, we would like to inspect the TensorFlow Speech Command dataset after download and extraction.
In this dataset, there are 31 audio folders. 20 of the words are core words, while 10 words are auxiliary words that could act as tests for algorithms in ignoring speeches that do not contain triggers. Included along with the 30 words is a collection of background noise audio files. The audio clips were originally collected by Google, and recorded by volunteers in uncontrolled locations around the world.
# Save audio sample labels
labels = [name for name in os.listdir(data_path) if name not in ['info.txt', 'LICENSE', 'validation_list.txt', 'README.md', 'testing_list.txt'] if os.path.isdir(data_path)]
# Get the folder list
folders = glob.glob(data_path + '/*')
# Number of samples in each category of audio clip
recordings = []
for i in folders:
if os.path.isdir(i):
samples = [f for f in os.listdir(i) if f.endswith('.wav') ]
recordings.append(len(samples))
printmd('**Core words and number of samples from audio samples:**')
print([(labels[i], recordings[i]) for i in range(0, len(labels))])
The list above is hard to read or compare the number between different audio sample folders. Let's visualize the audio sample distribution.
# Plot
data = [go.Histogram(x=folders, y=recordings, text='pop')]
trace = go.Bar(
x=labels,
y=recordings,
marker=dict(color = recordings),
text = recordings,
textposition='outside'
)
layout = go.Layout(
title='Number of recordings in given label',
xaxis = dict(title='Words'),
yaxis = dict(title='Number of recordings')
)
py.iplot(go.Figure(data=[trace], layout=layout))
# Play audio - sample 1
printmd('**Core word** - ' + filenames[0][0:2] )
printmd('**Speaker** - ' + filenames[0][3:])
ipd.Audio(os.path.join(data_path, filenames[0]))
# Play audio - sample 2
printmd('**Core word** - ' + filenames[1][0:3] )
printmd('**Speaker** - ' + filenames[1][4:])
ipd.Audio(os.path.join(data_path, filenames[1]))
# Play audio - sample 3
printmd('**Core word** - ' + filenames[2][0:2] )
printmd('**Speaker** - ' + filenames[2][3:])
ipd.Audio(os.path.join(data_path, filenames[2]))
# Play audio - sample 4
printmd('**Auxillary word** - ' + filenames[3][0:4] )
printmd('**Speaker** - ' + filenames[3][5:])
ipd.Audio(os.path.join(data_path, filenames[3]))
# Play audio - sample 5, another bird sound file
printmd('**Auxillary word** - ' + filenames[3][0:4] )
printmd('**Speaker** - ' + filenames[3][6:])
ipd.Audio(os.path.join(data_path, filenames[4]))
Next, we add the extracted data files to the Watson Studio project to make them available to the other notebooks.
# Verify that the extracted artifacts are located in the expected location
if not Path(data_path).exists():
print('Error. The extracted data files are not located in the {} directory.'.format(data_path_.name))
else:
# Save extracted data file(s) as project assets
data_asset_count = 0
for file in filenames:
# save data file as a data asset in the project
with open(data_path + '/' + file, 'rb') as f:
file = file.replace('/', '_')
print(file)
file = file.split('.')
print('Saving as {}.wav to project data asset ...'.format(file[0]))
project.save_data(file[0] + '.wav', f.read(), set_project_asset=True, overwrite=True)
data_asset_count = data_asset_count + 1
# remove the file to free up space
print('Number of added data assets: {} '.format(data_asset_count))
print('You are ready to run the other notebooks.')
Part 2 - Dataset Visualization
notebook to learn more about the data.This notebook was created by the Center for Open-Source Data & AI Technologies.
Copyright © 2020 IBM. This notebook and its source code are released under the terms of the MIT License.